Skip to content

ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SongXiaoXi
Copy link
Contributor

Description:
This PR replaces the original spin-based barrier in GGML with a futex-based yield barrier to improve thread scheduling efficiency and overall system performance.

Currently, the feature can be controlled using the CMake parameter GGML_YIELD_BARRIER, allowing users to enable or disable the yield barrier as needed.

Key Benefits:

  1. Improved Scalability
    The futex-based barrier allows threads to yield instead of busy-waiting. This reduces CPU waste and improves scalability when the number of threads exceeds the number of physical cores, or when other workloads are competing for CPU time.

  2. Better Performance on Hybrid Architectures
    On systems with heterogeneous cores (e.g., big.LITTLE or Intel Hybrid Architecture), yielding helps critical threads get scheduled on performance cores, potentially improving throughput (e.g., PP performance in multi-threaded inference).

  3. Power Efficiency and Thermal Stability
    By avoiding unnecessary spinning, this change can reduce power consumption and help maintain higher sustained performance, especially on thermally constrained devices. It may also mitigate CPU throttling under load.

Benchmark:

based on build: 42eb248 (5025)

Apple M1 (4P+4E) (disable Accelerate framework and Metal)

before

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 488.30 ± 28.06
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 108.54 ± 19.58

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 824.37 ± 7.58
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 62.45 ± 0.14

Apple M3 Pro (5P + 6E) (disable Accelerate framework and Metal)

before:

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 72.28 ± 0.39
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 11.89 ± 0.42

after:

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 91.85 ± 1.59
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.84 ± 0.20

Apple M4 (compile on M1 native)

before

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 4 pp512 15.33 ± 0.01
llama 8B F16 14.96 GiB 8.03 B CPU 4 tg128 4.85 ± 0.00

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 4 pp512 15.32 ± 0.01
llama 8B F16 14.96 GiB 8.03 B CPU 4 tg128 4.73 ± 0.00

before:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 10 pp512 27.93 ± 0.07
llama 8B F16 14.96 GiB 8.03 B CPU 10 tg128 5.98 ± 0.08

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 10 pp512 28.38 ± 0.07
llama 8B F16 14.96 GiB 8.03 B CPU 10 tg128 6.10 ± 0.00

Snapdragon 888 (X1 + A78x3 + A55x4)

before:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 210.31 ± 3.34
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 39.36 ± 0.35

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 300.16 ± 5.45
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 14.33 ± 0.08

before:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 80.65 ± 8.72
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 8.05 ± 0.05

after:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 95.31 ± 1.67
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 6.45 ± 0.05

Snapdragon 6Gen1 (A78x4 + A55x4)

before:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 196.30 ± 0.58
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 30.97 ± 0.17

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 261.19 ± 2.26
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 11.07 ± 0.11

before:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 79.43 ± 0.40
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 5.78 ± 0.04

after:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 79.56 ± 0.34
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 4.45 ± 0.01

Ryzen 9950X (light thermal throttling observed)

before:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 16 pp512 216.12 ± 0.17
llama 8B F16 14.96 GiB 8.03 B CPU 16 tg128 4.15 ± 0.00

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 16 pp512 222.44 ± 2.12
llama 8B F16 14.96 GiB 8.03 B CPU 16 tg128 4.15 ± 0.00

before:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 32 pp512 221.41 ± 2.07
llama 8B F16 14.96 GiB 8.03 B CPU 32 tg128 3.94 ± 0.00

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 32 pp512 222.19 ± 4.64
llama 8B F16 14.96 GiB 8.03 B CPU 32 tg128 3.76 ± 0.04

Ryzen 9950X (spin-based bottleneck: threads > cores)

before:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 pp512 59.36 ± 0.43
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 tg128 0.26 ± 0.00

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 pp512 2052.45 ± 4.99
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 tg128 47.28 ± 1.20

Conclusion:

Across most tested devices, the pp512 workload consistently benefits from the futex-based yield barrier, showing noticeable throughput improvements. This is especially evident on high-core-count or hybrid-core systems, where reduced spinning improves scheduling fairness and efficiency.

However, for tg128 — which is typically less compute-intensive and more sensitive to load imbalance — performance may degrade slightly in some cases. This is likely due to the lower thread saturation and increased context switching overhead introduced by yielding, which affects lighter workloads more noticeably.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 23, 2025
@SongXiaoXi
Copy link
Contributor Author

Hi, I would like to ask for your opinion regarding the use of futex-based yield barriers versus traditional spin barriers.
While yielding improves scalability and efficiency on overloaded systems or hybrid architectures, it may introduce additional context-switching overhead for lighter workloads.

Would appreciate your thoughts on whether a yield-based approach is a good fit for GGML’s threading model on mobile devices or servers under heavy load.

Thank you for your consideration!

@slaren
Copy link
Member

slaren commented Apr 29, 2025

Hi, sorry for taking so long to respond to this. I think this is very interesting and definitely something that we should working towards, but as it is, the performance hit during generation is in my opinion too high for this to be useful in this state. Ideally, this should be something that is always enabled, and is triggered automatically after spinning for a while. I understand that the code already does this, so I wonder if it is a matter of tuning. If I am not mistaken, the gcc openmp implementation does something like this as well. Hiding this behind a compile flag that is disabled by default, is likely to result in this being dead code that very few people are going to use.

On a less important note, I was not able to replicate the results on an M3 Max. In my tests, this was always slower.

cmake_opts="-DGGML_METAL=OFF -DGGML_BLAS=OFF -DGGML_OPENMP=OFF -DGGML_YIELD_BARRIER=ON" scripts/compare-commits.sh master yield_barrier -m models/qwen2.5-coder-0.5b-instruct-q4_0.gguf -p 256 -n 64 -r 2 -t 4,8,10,12,13,14,15,16 --delay 15

Model Threads Test t/s master t/s yield_barrier Speedup
qwen2 1B Q4_0 4 pp256 504.27 494.29 0.98
qwen2 1B Q4_0 4 tg64 189.37 95.92 0.51
qwen2 1B Q4_0 8 pp256 958.59 902.25 0.94
qwen2 1B Q4_0 8 tg64 244.28 64.59 0.26
qwen2 1B Q4_0 10 pp256 1160.34 1073.05 0.92
qwen2 1B Q4_0 10 tg64 241.42 50.81 0.21
qwen2 1B Q4_0 12 pp256 1371.02 1229.41 0.90
qwen2 1B Q4_0 12 tg64 236.30 42.52 0.18
qwen2 1B Q4_0 13 pp256 1306.81 1192.42 0.91
qwen2 1B Q4_0 13 tg64 48.94 34.78 0.71
qwen2 1B Q4_0 14 pp256 1370.33 1190.25 0.87
qwen2 1B Q4_0 14 tg64 42.34 29.45 0.70
qwen2 1B Q4_0 15 pp256 1409.30 1177.26 0.84
qwen2 1B Q4_0 15 tg64 147.98 26.58 0.18
qwen2 1B Q4_0 16 pp256 1388.65 1183.83 0.85
qwen2 1B Q4_0 16 tg64 113.04 24.09 0.21

@SongXiaoXi
Copy link
Contributor Author

"If I am not mistaken, the gcc openmp implementation does something like this as well."

You're right, GCC’s OpenMP implementation does something similar. For reference, here are a few relevant links:

I've also implemented a check for the number of affinity cores to avoid unnecessary spinning — particularly helpful on processes limited by cpuset.

"I was not able to replicate the results on an M3 Max. In my tests, this was always slower."

Regarding your M3 Max(12P+4E I guess) results:

  • A long-tail effect: once the 12 performance cores complete their tasks, only 4 of them are reassigned to handle the remaining workload from the slower efficiency cores, leaving the other 8 P-cores idle.
  • The tg phase in qwen2 1B Q4_0 is not compute-intensive enough, making thread scheduling overhead more noticeable.

"Hiding this behind a compile flag ... likely to result in dead code"

Completely agree. The goal is absolutely to make yield_barrier the default in the future. The flag is just a temporary measure while we sort out tuning for cases where generation throughput suffers significantly.

Below is a set of benchmark results from my M3 Pro(5P+6E). It shows that pp512 and pp256 consistently benefit from yield_barrier, while tg128 and tg64 performance drops, especially with higher thread counts — supporting the idea that automatic tuning (e.g., adjusting threads per phase) might be a better long-term solution. According to your results, even with the spin policy, the best performance is achieved with 8 threads, not more.
To fully support hybrid-core CPUs more efficiently, it might be worth considering a work-stealing task queue — but that's still a long way to go.

Model Threads Test t/s master t/s yield_barrier Speedup
qwen2 1B Q4_0 5 pp512 1233.17 1207.73 0.98
qwen2 1B Q4_0 5 tg128 221.44 96.06 0.43
qwen2 1B Q4_0 6 pp512 700.10 941.10 1.34
qwen2 1B Q4_0 6 tg128 202.61 64.65 0.32
qwen2 1B Q4_0 7 pp512 766.82 1076.56 1.40
qwen2 1B Q4_0 7 tg128 210.89 57.33 0.27
qwen2 1B Q4_0 8 pp512 865.30 1185.22 1.37
qwen2 1B Q4_0 8 tg128 210.73 55.28 0.26
qwen2 1B Q4_0 9 pp512 931.69 1262.22 1.35
qwen2 1B Q4_0 9 tg128 206.39 45.75 0.22
qwen2 1B Q4_0 10 pp512 973.63 1308.40 1.34
qwen2 1B Q4_0 10 tg128 192.60 45.90 0.24
qwen2 1B Q4_0 11 pp512 888.40 1244.00 1.40
qwen2 1B Q4_0 11 tg128 150.34 41.88 0.28
Model Threads Test t/s master t/s yield_barrier Speedup
qwen2 1B Q4_0 5 pp256 1442.20 1390.71 0.96
qwen2 1B Q4_0 5 tg64 208.26 99.09 0.48
qwen2 1B Q4_0 6 pp256 724.18 1072.99 1.48
qwen2 1B Q4_0 6 tg64 202.30 65.29 0.32
qwen2 1B Q4_0 7 pp256 805.83 1197.01 1.49
qwen2 1B Q4_0 7 tg64 209.09 59.03 0.28
qwen2 1B Q4_0 8 pp256 937.62 1306.49 1.39
qwen2 1B Q4_0 8 tg64 211.49 55.08 0.26
qwen2 1B Q4_0 9 pp256 1026.91 1388.99 1.35
qwen2 1B Q4_0 9 tg64 204.45 49.90 0.24
qwen2 1B Q4_0 10 pp256 1044.71 1457.37 1.39
qwen2 1B Q4_0 10 tg64 180.48 45.55 0.25
qwen2 1B Q4_0 11 pp256 942.47 1322.39 1.40
qwen2 1B Q4_0 11 tg64 143.43 42.41 0.30

@slaren
Copy link
Member

slaren commented Apr 30, 2025

To fully support hybrid-core CPUs more efficiently, it might be worth considering a work-stealing task queue

Wouldn't this be the same that is already implemented for mul_mat and mul_mat_id?

current_chunk = atomic_fetch_add_explicit(&params->threadpool->current_chunk, 1, memory_order_relaxed);

However, this is not supported when repacking Q4_0, since it uses a different implementation of the matrix multiplication functions.

bool compute_forward(struct ggml_compute_params * params, struct ggml_tensor * op) override {

@SongXiaoXi
Copy link
Contributor Author

SongXiaoXi commented Apr 30, 2025

Wouldn't this be the same that is already implemented for mul_mat and mul_mat_id?

Ah, I see — I missed this part, you're right. That does implement a similar chunk-level scheduling mechanism.

So the performance regressions I’m seeing are probably due to the spin count not being tuned well — likely due to the need for a smarter, adaptive spin-wait threshold to reduce the cost of falling back to futex() syscalls. I'll do more testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants